Skip to content

Added Prometheus client to get model server metrics #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

aish1331
Copy link
Contributor

@aish1331 aish1331 commented May 1, 2025

Issue# #17

Description

Added Prometheus Client that connects to a Prometheus instance to fetch metrics for the benchmark. This is assuming Prometheus has model server metrics.

Engine tested: vLLM

Metrics fetched from vLLM server:

Queue Size
TTFT
TPOT
Input Token Count
Output Token Count
Successful Request Count
Request Latency
If metric_client is not passed in config.yml, the report generation falls back to benchmark request metrics.

TODO
Fetch KV cache usage metric from vLLM server via Prometheus.

Following are the logs with the report generated for a 2 stage benchmark run

As of this PR, metrics for the complete run are reported i.e. metrics for all the stages are aggregated.

Using configuration from: config.yml
Stage 0 - run started
Stage 0 - run completed
Stage 1 - run started
Stage 1 - run completed
Waiting for 20 seconds for Prometheus to collect metrics...


Generating Report ..
--------------------------------------------------
Request Summary
--------------------------------------------------
prompt_tokens_per_second: 85.52936651769419
output_tokens_per_second: 8.609513454470223
requests_per_second: 0.46737358752838354
avg_request_latency: 0.0852452470969997
median_request_latency: 0.12763349153101444
p90_request_latency: 0.13852914329618216
p99_request_latency: 0.14108252413570882
avg_time_to_first_token: 0.0
median_time_to_first_token: 0.0
p90_time_to_first_token: 0.0
p99_time_to_first_token: 0.0
avg_time_per_output_token: 0.0
median_time_per_output_token: 0.0
p90_time_per_output_token: 0.0
p99_time_per_output_token: 0.0
total_requests: 19
avg_prompt_tokens: 183
avg_output_tokens: 18
avg_queue_length: 0
--------------------------------------------------
Metrics Client Summary
--------------------------------------------------
prompt_tokens_per_second: 94.227051529898
output_tokens_per_second: 9.866008932737817
requests_per_second: 0.49996666888874075
avg_request_latency: 0.08623366355895996
median_request_latency: 0.15
p90_request_latency: 0.27
p99_request_latency: 0.297
avg_time_to_first_token: 0.006058311462402343
median_time_to_first_token: 0.007321428571428572
p90_time_to_first_token: 0.009464285714285715
p99_time_to_first_token: 0.009946428571428571
avg_time_per_output_token: 0.004279663193029653
median_time_per_output_token: 0.005
p90_time_per_output_token: 0.009
p99_time_per_output_token: 0.0099
total_requests: 20
avg_prompt_tokens: 115
avg_output_tokens: 11
avg_queue_length: 0

Linter and other Validation Checks

Output of make validate command:

Creating virtual environment if it doesn't exist...
Activating virtual environment and installing dependencies...
.venv/bin/pip install --upgrade pip
Requirement already satisfied: pip in ./.venv/lib/python3.12/site-packages (25.1.1)
.venv/bin/pip install -e .
Obtaining file:///home/ubuntu/forks/clone/inference-perf
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
...
Successfully installed inference-perf-0.0.1
Formatting Python files with ruff format...
.venv/bin/ruff format
27 files left unchanged
Linting Python files with ruff check...
.venv/bin/ruff check
All checks passed!
Running type checking with mypy...
.venv/bin/mypy --strict ./inference_perf
Success: no issues found in 25 source files

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 1, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @aish1331!

It looks like this is your first PR to kubernetes-sigs/inference-perf 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/inference-perf has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 1, 2025
@achandrasekar
Copy link
Contributor

/assign @SachinVarghese

Copy link
Contributor

@SachinVarghese SachinVarghese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Minor changes requested

metrics_client: MetricsClient | None = None
if config.metrics_client:
if config.metrics_client.type == MetricsClientType.PROMETHEUS:
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will simplify the client code

Suggested change
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# assumption: all metrics clients have metrics exported in Prometheus format
raise NotImplementedError

@abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods repeated?

time_per_request=3,
)
)
print("Processing request - " + str(payload.data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep the sleep statement here to mock the processing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And lets retain the behaviour of collecting request metrics in this client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

config.yml Outdated
@@ -11,3 +11,8 @@ tokenizer:
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
data:
type: shareGPT
# metrics_client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets uncomment these and let the test runner fallback incase the prometheus server is unreachable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, the test runner prints the request summary but it also prints all the error logs for connection failures for each query.

Comment on lines 30 to 81
request_metrics = runtime_parameters.model_server_client.get_request_metrics()
if len(request_metrics) > 0:
total_prompt_tokens = sum([x.prompt_tokens for x in request_metrics])
total_output_tokens = sum([x.output_tokens for x in request_metrics])
if runtime_parameters.duration > 0:
prompt_tokens_per_second = total_prompt_tokens / runtime_parameters.duration
output_tokens_per_second = total_output_tokens / runtime_parameters.duration
requests_per_second = len(request_metrics) / runtime_parameters.duration
else:
prompt_tokens_per_second = 0.0
output_tokens_per_second = 0.0
requests_per_second = 0.0

request_summary = ModelServerMetrics(
total_requests=len(request_metrics),
requests_per_second=requests_per_second,
prompt_tokens_per_second=prompt_tokens_per_second,
output_tokens_per_second=output_tokens_per_second,
avg_prompt_tokens=int(statistics.mean([x.prompt_tokens for x in request_metrics])),
avg_output_tokens=int(statistics.mean([x.output_tokens for x in request_metrics])),
avg_request_latency=statistics.mean([x.time_per_request for x in request_metrics]),
median_request_latency=statistics.median([x.time_per_request for x in request_metrics]),
p90_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=10)[8],
p99_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=100)[98],
avg_time_to_first_token=0.0,
median_time_to_first_token=0.0,
p90_time_to_first_token=0.0,
p99_time_to_first_token=0.0,
median_time_per_output_token=0.0,
p90_time_per_output_token=0.0,
p99_time_per_output_token=0.0,
avg_time_per_output_token=0.0,
avg_queue_length=0,
)
for field_name, value in summary:
print("-" * 50)
print("Request Summary")
print("-" * 50)
for field_name, value in request_summary:
print(f"{field_name}: {value}")
else:
print("Report generation failed - no metrics collected")
print("Report generation failed - no request metrics collected")

if self.metrics_client is not None:
print("-" * 50)
print("Metrics Client Summary")
print("-" * 50)
metric_client_summary = self.metrics_client.collect_model_server_metrics(runtime_parameters)
if metric_client_summary is not None:
for field_name, value in metric_client_summary:
print(f"{field_name}: {value}")
else:
print("Report generation failed - no metrics client summary collected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the calculation relevant parts of the code here to the base class so that it can be reused by various report types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@aish1331 aish1331 force-pushed the prometheus-metrics-client branch from 802ed3d to bcf2ee6 Compare May 5, 2025 21:21
Copy link

linux-foundation-easycla bot commented May 5, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 5, 2025
@aish1331 aish1331 force-pushed the prometheus-metrics-client branch from db44434 to 034cdba Compare May 5, 2025 22:20
@achandrasekar
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 5, 2025
# Wait for the metrics to be ready
metrics_client.wait()

end_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't include the metrics wait time above in the total test run time if all the requests have finished already since that can skew the test results if we are waiting for the metrics scrape interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the duration of the test should be calculated correctly and also consider that this could be a multi-stage run so durations needs to be claculated for each stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
To tackle that I am now separately calculating perf run duration here and have moved the wait logic to Prometheus Client.
This duration will be used for computing the default request summary.

I now pass benchmark start time to the metrics client in perf runtime params. We call the wait function and then calculate the end time.
In this way, prometheus query runs for:
start timestamp = benchmark start time and
end timestamp = benchmark end + wait() ensuring we get all the metrics.

@achandrasekar
Copy link
Contributor

Also, if you can run make validate and post the lint check output here that it is successful that would be great. We are trying to get this to run for all PRs. But it might take a day or two before we can enable that.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2025
@SachinVarghese
Copy link
Contributor

@aish1331 can you rebase, address pending comments and create any follow up issues so that we can get this work in?

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025
@aish1331
Copy link
Contributor Author

aish1331 commented May 8, 2025

@Bslabe123 FYI I am creating separate report files for request summary and the summary generated by metrics client.

@aish1331
Copy link
Contributor Author

aish1331 commented May 8, 2025

Added an issue to modify reportgen for multi-stage runs: #70

@aish1331 aish1331 requested a review from SachinVarghese May 8, 2025 07:25
Copy link
Contributor

@SachinVarghese SachinVarghese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aish1331, SachinVarghese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2450773 into kubernetes-sigs:main May 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants